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Abstract 

Character  groundtruth  for  real,  scanned  document  images  is  crucial  for  evaluating  the 
performance  of  OCR  systems,  training  OCR  algorithms,  and  validating  document  degra¬ 
dation  models.  Unfortunately,  manual  collection  of  accurate  groundtruth  for  characters 
in  a  real  (scanned)  document  image  is  not  practical  because  (i)  accuracy  in  delineating 
groundtruth  character  bounding  boxes  is  not  high  enough,  (ii)  it  is  extremely  laborious 
and  time  consuming,  and  (iii)  the  manual  labor  required  for  this  task  is  prohibitively 
expensive.  In  this  paper  we  describe  a  closed-loop  methodology  for  collecting  very  ac¬ 
curate  groundtruth  for  scanned  documents.  We  hrst  create  ideal  documents  using  a 
typesetting  language.  Next  we  create  the  groundtruth  for  the  ideal  document.  The  ideal 
document  is  then  printed,  photocopied  and  scanned.  A  registration  algorithm  estimates 
the  global  geometric  transformation  and  then  performs  a  robust  local  bitmap  match  to 
register  the  ideal  document  image  to  the  scanned  document  image.  Finally,  groundtruth 
associated  with  the  ideal  document  image  is  transformed  using  the  estimated  geometric 
transformation  to  create  the  groundtruth  for  the  scanned  document  image.  This  method¬ 
ology  is  very  general  and  can  be  used  for  creating  groundtruth  for  typeset  documents 
in  any  language,  layout,  font,  and  style.  We  have  demonstrated  the  method  by  gener¬ 
ating  groundtruth  for  English,  Hindi,  and  FAX  document  images.  The  cost  of  creating 
groundtruth  using  our  methodology  is  minimal.  If  character,  word  or  zone  groundtruth 
is  available  for  any  real  document,  the  registration  algorithm  can  be  used  to  generate  the 
corresponding  groundtruth  for  a  rescanned  version  of  the  document. 
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research  by  the  National  Science  Foundation  under  grant  EIA0130422  and  the  Depart¬ 
ment  of  Defense  under  contract  MDA9049-C6-1250  is  gratefully  acknowledged. 
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1  Introduction 


Character  groundtruth  for  real,  scanned  docnment  images  is  extremely  nsefnl  for  evaln- 
ating  the  performance  of  OCR  systems,  training  OCR  algorithms,  and  validating  docn¬ 
ment  degradation  models.  Unfortnnately,  mannal  collection  of  accnrate  gronndtrnth  for 
characters  in  a  real  (scanned)  docnment  image  is  not  possible  becanse  (i)  accnracy  in 
delineating  gronndtrnth  character  bonnding  boxes  is  not  high  enongh,  (ii)  it  is  extremely 
laborions  and  time  consnming  and  (iii)  the  mannal  labor  reqnired  for  this  task  is  pro¬ 
hibitively  expensive.  In  fact,  snccessfnl  OCR  companies  have  large  nnmber  of  employees 
whose  only  job  is  to  mannally  annotate  scanned  docnments  with  the  gronnd  trnth. 

In  this  paper  we  present  a  closed-loop  methodology  for  collecting  very  accnrate 
gronndtrnth  for  scanned  docnments.  We  hrst  create  ideal  docnments  nsing  a  typesetting 
langnage.  Next  we  create  the  gronndtrnth  for  the  ideal  docnment.  The  ideal  docnment  is 
then  printed,  photocopied  and  then  scanned.  A  registration  algorithm  estimates  the  geo¬ 
metric  transformation  that  registers  the  ideal  docnment  image  to  the  scanned  docnment 
image.  Finally,  gronndtrnth  associated  with  the  ideal  docnment  image  is  transformed 
nsing  the  estimated  geometric  transform  to  create  the  gronndtrnth  for  the  scanned  doc¬ 
nment  image. 

The  gronndtrnth  generated  by  this  method,  besides  being  directly  nsefnl  for  evaln- 
ating  the  performance  of  OCR  systems,  is  crncial  for  validating  docnment  degradation 
models  [KHB'’'94,  KBH95].  In  fact,  mannally  collected  gronndtrnth  invariably  contains 
many  ontliers  and  forces  the  nse  of  robnst  statistical  techniqnes  [KBB95].  For  a  more 
detailed  discnssion  please  see  [Kan96].  Parts  of  this  work  was  earlier  reported  in  [KB96]. 

We  are  nnaware  of  any  literatnre  that  nses  a  method  similar  to  onrs  for  antomatically 
collecting  gronndtrnth.  Certainly  lot  of  work  on  docnment  registration  has  been  reported 
in  the  past.  Bowever,  most  of  this  literatnre  pertains  to  the  problem  where  a  hxed  ideal 
form  has  to  be  registered  to  a  scanned,  hand-hlled  form.  The  general  idea  is  to  extract 
the  information  hlled  by  a  hnman  in  the  varions  helds  of  the  form.  A  common  method  is 
to  extract  featnres  from  the  scanned  forms  and  match  them  to  the  featnres  in  the  ideal 
form  [DR93,  CF90].  Unfortnnately  we  cannot  nse  this  body  of  work  since  there  are  no 
nniversal  landmarks  that  appear  at  hxed  locations  in  each  type  of  docnment. 

2  Document  Groundtruth 

Gronndtrnth  information  is  essential  for  evalnating  any  docnment  nnderstanding  sys¬ 
tems.  By  groundtruth  we  mean  the  correct  location,  size,  font  type,  and  bonnding  box 
of  the  individnal  symbols  on  the  docnment  image.  More  global  gronndtrnth  associated 
with  a  docnment  image  conld  inclnde  layont  information  (snch  as  zone  bonnding  boxes 
demarcating  individnal  words,  paragraphs,  article  and  section  titles,  addresses,  footnotes 
etc.)  and  style  information  (general  information  regarding  nnmber  of  colnmns,  right  jns- 
tihed  or  not,  rnnning  head;  etc).  The  gronndtrnth  information,  of  conrse,  needs  to  be  fOO 
percent  accnrate,  otherwise  the  systems  being  evalnated  will  be  penalized  incorrectly. 

Gronndtrnth  information  is  invalnable  for  performance  evalnation  of  OCR  algorithms. 
Use  of  completely  annotated  gronndtrnth  permits  ns  to  stndy  which  factors  affect  the 
algorithm  performance  the  most.  This  in  tnrn  allows  an  algorithm  developer  to  think 
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of  ways  to  improve  the  algorithm.  Among  the  dimensions  of  complete  annotation  are 
layont,  style,  font  type,  font  size,  symbol  location,  and  symbol  identity. 

Style:  Page  nnmbers  can  be  printed  on  top  or  bottom;  the  document  may  or  may  not 
have  a  rnnninghead;  varions  indentation  lengths  can  be  varied;  the  colnmns  can  be 
jnstihed  or  ragged;  nnmber  of  colnmns  can  be  changed.  Thns  by  changing  these 
variables,  we  can  stndy  how  robnst  the  OCR  system  is  with  respect  to  these  style 
parameters. 

Font:  OCR  systems  can  be  very  sensitive  to  the  fonts  nsed.  Thns  we  can  stndy  the  per¬ 
formance  of  the  OCR  algorithm  by  changing  the  varions  fonts  (Helvetica,  Times 
Roman,  etc.)  nsed  in  the  docnments,  while  keeping  the  text  nnchanged.  Fnr- 
thermore,  OCR  algorithms  nsnally  have  a  snbsystem  that  identihes  the  font  in 
a  particnlar  zone.  Performance  of  snch  systems  can  be  done  if  the  gronndtrnth 
information  abont  the  fonts  is  available. 

Size:  .Jnst  as  some  OCR  systems  have  snbsystems  that  identify  the  font  types,  other 
OCR  systems  have  snbsystems  that  identify  character  size,  which  then  is  nsed  by 
the  recognition  engine.  Having  the  bonnding  box  information  associated  with  each 
symbol  on  the  page  will  allow  ns  to  evalnate  the  performance  of  these  snbsystems. 

Location  and  Identity:  Finally,  since  the  gronndtrnth  contains  the  location  and  iden¬ 
tity  (e.g.,  which  character,  or  math  symbol)  of  each  symbol  on  the  page,  we  can 
nse  this  information  to  evalnate  the  performance  of  the  OCR  system. 

In  onr  previons  work  [KHP93,  KHP94,  HP'’'94],  we  showed  how  very  accnrate  gronndtrnth 
for  ideal  docnments  images  can  generated.  We  also  showed  that  the  gronndtrnth  for 
ideal  docnment  images  conld  also  be  nsed  as  gronndtrnth  for  synthetically  degraded  doc¬ 
nments.  The  actnal  process  was  as  follows.  We  hrst  created  noise-free  docnment  images 
nsing  the  fATgX  typesetting  langnage  [Knn88,  Lam86],  and  extracted  the  gronndtrnth 
information  from  the  DVI  hies  generated  by  RTgX.  Then  we  then  synthetically  degraded 
the  ideal  binary  docnment  image  nsing  a  local  docnment  degradation  model.  The  the 
gronndtrnth  for  the  synthetically  degraded  docnments  is  100%  accnrate,  is  easily  gener¬ 
ated  (few  seconds  on  SPARC  5),  and  does  not  cost  anything  once  we  have  the  RTgXhles. 

Unfortnnately,  once  the  ideal  docnment  image  is  printed  and  scanned,  the  gronndtrnth 
information  associated  with  the  ideal  docnment  image  is  not  nsable  for  the  scanned 
docnment  image  since  the  scanned  docnment  is  geometrically  transformed.  That  is,  the 
printing  and  scanning  process,  to  a  hrst  order  effect,  translates,  scales  and  rotates  each 
character  on  the  docnment  image,  besides  adding  pixel  noise.  The  second  order  effect  on 
each  character  is  translated  by  a  few  pixels  from  its  nominal  position  after  the  translation, 
rotation  and  scale. 

Thns  the  only  alternative  is  to  mannally  enter  the  gronndtrnth  for  the  real,  scanned 
docnments.  This  task  s  extremely  laborions,  time  consnming  and  prohibitively  expen¬ 
sive.  Fnrthermore,  the  person  creating  the  gronndtrnth  shonld  be  knowledgeable  in  the 
langnage  in  which  the  docnment  is  written. 
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In  the  following  section  we  ontline  an  antomatic,  closed-loop  approach  to  generation 
of  gronndtrnth  for  real  docnments.  This  methodology  is  very  general  and  is  independent 
of  the  langnage  in  which  the  docnment  text  is  written. 

3  The  groundtruth  generation  methodology:  a  closed-loop  approach 

First,  the  docnments  are  typeset  nsing  I^TgX.  Next  these  docnments  are  converted  into 
binary  bitmap  images,  which  are  onr  ideal  noise-free  documents.  When  the  ideal  bitmap 
is  generated  from  the  DVI  hies,  the  corresponding  gronndtrnth  (location,  bonnding  box, 
font  type  and  size,  and  identity  of  each  character)  is  also  generated. 

The  ideal  docnment  image  is  then  printed  and  scanned.  At  this  point,  althongh 
the  gronndtrnth  for  the  ideal  image  is  known,  the  gronndtrnth  for  the  real  scanned 
image  is  nnknown.  However,  if  the  exact  transformation  that  registers  the  ideal  and 
degraded  images  were  known,  we  conld  compnte  the  gronndtrnth  for  the  real  image 
simply  by  transforming  the  bonnding  box  coordinates  of  the  ideal  gronndtrnth  by  the 
same  transformation. 

Thns  the  gronndtrnth  creation  problem  now  rednces  to  hnding  an  appropriate  trans¬ 
formation  that  models  the  geometric  distortions  the  docnment  image  nndergoes  when 
it  is  printed  and  then  scanned.  Whatever  the  fnnctional  form  of  the  transformation,  to 
estimate  the  parameters  of  the  transformation  we  reqnire  corresponding  featnre  points 
from  the  ideal  and  real  images.  Thns,  a  rongh  ontline  of  the  gronndtrnth  generation 
method  is: 

f.  Generate  ideal  docnment  images  and  the  associated  character  gronndtrnth. 

2.  Print  the  ideal  docnments  and  scan  it  back. 

3.  Find  featnre  points  pi, .  .  .  and  qi, .  .  .  ,qn  in  the  corresponding  ideal  and  real 
docnment  images. 

4.  Fstablish  the  correspondence  between  the  points  pi  and  qi. 

5.  Fstimate  the  parameters  of  the  transformation  T  that  maps  pi  to  q^. 

6.  Transform  the  ideal  gronndtrnth  information  nsing  the  estimated  transformation 

T. 

The  transformation  T  mentioned  in  the  procednre  above  is  a  2D  to  2D  mapping. 
That  is  r  :  — S’  .  Thns,  if  (x,y)  =  T(u^v)^  where  (u,v)  is  the  ideal  point  and  (x^y) 
is  the  scanned  point,  x  in  general  may  be  a  fnnction  of  both  u  and  v]  and  same  is  trne 
regarding  y. 

Generation  of  the  ideal  docnment  image  and  the  corresponding  gronndtrnth  is  achieved 
by  the  synthetic  gronndtrnth  generation  software  DVI2TIFF,  which  was  described  in 
[Kan96]  The  software  is  available  with  the  UW  Fnglish  Docnment  Database  [HP'’'94]. 
Given  a  transformation  T,  transforming  the  gronndtrnth  information  is  trivial  -  all  that 
needs  to  be  done  is  transform  the  bonnding  box  coordinates  of  the  ideal  gronndtrnth 
nsing  the  transformation  T.  Thns,  there  are  two  main  problems:  hnding  corresponding 
featnre  points  in  two  docnment  images,  and  hnding  the  transformation  T. 
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4  Geometric  Transformations 

Suppose  we  are  given  the  coordinates  of  feature  points  pi  on  an  ideal  document  page,  and 
the  coordinates  of  the  corresponding  feature  points  qi  on  the  real  document  page.  (How 
these  feature  points  are  extracted  and  matched  is  described  in  section  6.)  The  problem 
is  to  hypothesize  a  functional  form  for  the  transformation  T,  that  maps  the  ideal  feature 
points  coordinates  to  the  real  point  coordinates,  and  a  corresponding  noise  model.  To 
ensure  that  the  transformation  T  is  the  same  throughout  the  area  of  the  document  page, 
we  choose  the  points  pi  from  all  over  the  document  page. 

The  possible  candidates  for  the  geometric  transformation  and  pixel  perturbation  are 
similarity,  affine,  and  projective  transformations: 

Similarity  Transformation:  This  transformation  is  dehned  by  the  equation: 


where  (ui^Vi)  is  the  ideal  point,  is  the  transformed  point,  and  ~ 

N[0,a‘^I)  is  the  noise. 

In  the  above  parameterization  of  the  similarity  transformation,  a  represents  the 
product  of  a  nonnegative  isotropic  scale  and  cosine  of  the  rotation  angle;  b  represents 
the  product  of  the  nonnegative  scale  and  sine  of  the  rotation  angle;  and  ty 
represent  the  translation  in  x  and  y  directions.  This  parametrization  is  linear 
and  unconstrained  in  the  parameters,  unlike  the  parametrization  in  terms  of  scale, 
cosine  and  sine  of  rotation  angle,  and  translation. 

Affine  Transformation:  In  this  case  we  assume  that  the  real  image  is  an  affine  trans¬ 
formation  of  the  ideal  image.  The  affine  transformation  allows  for  rotation,  trans¬ 
lation,  anisotropic  scale,  and  shear.  The  functional  form  that  maps  the  ideal  point 
onto  the  real  point  is: 


where  (a,  &,  c,  d,  e, /)  is  the  affine  transform  parameters,  (uj,Uj)  is  the  ideal  point, 
(xi^yi)  is  the  transformed  point,  and  ~  fV(0,  is  the  noise. 

Projective  Transformation:  Here  the  assumption  is  that  the  real  image  is  a  perspec¬ 
tive  projection  of  an  image  on  a  plane  onto  another  nonparallel  plane.  The  func¬ 
tional  form  that  maps  the  ideal  point  (ui^Vi)  onto  the  real  point  (xi^yi)  is 

/  '\  _  _ f _  /  aiUi  +  biVi  +  \  f  Vi  \  /gx 

\  Vi  j  asUi  +  bsVi  +  1  \  a2Ui  +  +  C2  j  \  Y^i  j 

where,  (ui^Vi)  is  the  ideal  point,  (xi^yi)  is  the  transformed  point,  and  ~ 

fV(0,  is  the  noise.  After  inspection  it  can  be  seen  that  the  equations  are  linear 

and  unconstrained  in  the  eight  parameters  Ui,  &i,  ci,  02,  C2,  03,  63.  Discussion  on 

the  estimation  of  these  parameters  can  be  found  in  subsection  5.  This  parameteri¬ 
zation  accounts  for  rotation,  translation  and  the  center  of  perspectivity  parameters. 
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In  the  above  discussion  a  can  be  assumed  to  be  known  and  a  function  of  the  spatial 
quantization  error  and  the  image  processing  algorithm  that  is  used  to  detect  the  feature 
points. 

Each  of  these  models  can  be  used  to  ht  the  data.  Nevertheless,  the  question  is  which 
model,  if  any,  models  the  transformations  correctly,  and  how  does  one  go  about  proving 
the  hypothesis  quantitatively? 

In  the  next  section,  we  show  how  to  estimate  the  parameters  of  the  three  models.  In 
the  section  that  follows  we  show  how  to  statistically  validate/invalidate  the  models. 

5  Estimation  of  geometric  transformation  parameters 

Note  that  all  the  three  models  are  linear  in  the  parameters.  Each  corresponding  point 
provides  two  linear  constraints  on  the  parameters.  Thus  we  need  at  least  two  corre¬ 
sponding  points  for  estimating  the  similarity  parameters,  three  corresponding  points  for 
affine,  and  four  for  projective.  If  we  have  more  corresponding  points  than  the  minimum 
required,  we  can  solve  for  the  parameters  of  the  transformation  in  a  least  squares  sense, 
which  also  happens  to  be  the  maximum  likelihood  estimate  of  the  parameters  under  the 
Gaussian  noise  model. 


5.1  Similarity  transformation 

The  similarity  equations  given  in  equation  (1)  can  be  rearranged  into  the  following  form: 
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where  (ui,Vi)  is  the  ideal  point,  (xi^yi)  is  the  transformed  point,  (a^b^tx^ty)  are  the 
similarity  transform  parameters,  and  {rji^tpi)  is  the  noise.  If  there  are  n  corresponding 
points,  the  above  equation  can  be  written  as: 
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The  above  equation  can  be  written  in  a  compact  matrix  form  as  follows: 

b  =  Ap  +  n 


(5) 


(6) 


where  b  is  the  2n  X  1  vector  of  x  and  y,  A  is  the  2n  X  4  form  matrix,  p  is  the  4  X  f 
vector  of  unknown  parameters,  and  n  is  the  2n  X  f  vector  of  noise  values.  If  the  number 
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of  corresponding  points  n  is  two,  we  have  fonr  eqnations  in  fonr  nnknowns,  and  thns  can 
solve  for  p  nniqnely  by  solving  the  system  of  eqnations: 

b  =  Ap.  (7) 

However,  if  we  have  more  than  two  correspondences,  we  can  solve  for  p  in  a  least  sqnares 
sense  by  solving  the  following  linear  system  of  eqnations  [Kan96]: 

A*b  =  A*Ap.  (8) 


5.2  Affine  transformation 

The  affine  eqnations  given  in  eqnation  (2)  can  be  rearranged  into  the  following  form: 


Xi 

Vi 


Ui 

0 


Vi 

0 


0 

Ui 


0 

Vi 


t  0 
0  t 


(  a\ 
h 
c 
d 
e 

\f  J 


+ 


il’i 
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where  (ui^Vi)  is  the  ideal  point,  and  (xi^yi)  is  the  transformed  point,  d,  f)  are 

the  affine  transform  parameters,  and  (rji^tpi)  is  the  noise..  If  there  are  n  corresponding 
points,  the  above  eqnation  can  be  written  as: 
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The  above  eqnation  can  be  written  in  a  compact  matrix  form  as  follows: 

b  =  Ap  +  n 


(11) 


where  b  is  the  2n  X  t  vector  of  x  and  y,  A  is  the  2n  X  6  form  matrix,  p  is  the  6  X  t 
vector  of  nnknown  parameters,  and  n  is  the  2n  X  t  vector  of  nnknown  noise  valnes.  If 
the  nnmber  of  corresponding  points  n  is  three,  we  have  six  eqnations  in  six  nnknowns, 
and  thns  can  solve  for  p  nniqnely  by  solving  the  system  of  eqnations 


b  =  Ap. 


(12) 


If  we  have  more  than  three  correspondences,  we  can  solve  for  p  in  a  least  sqnares  sense 
by  solving  the  following  linear  system  of  eqnations  [Kan96]: 


A*b  =  A*Ap. 


(13) 
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5.3  Projective  transformation 

The  projective  transformation  eqnations  given  in  eqnation  (3)  can  be  rearranged  in  the 
following  form. 
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where  (n*,  Vi)  is  the  ideal  point,  (xj,  yi)  is  the  transformed  point,  (oi,  &i,  Ci,  02,  C2,  03,  63) 

are  the  projective  transform  parameters,  and  {rji^tpi)  is  the  noise.,  transformed  point.  If 
there  are  n  corresponding  points,  the  above  eqnation  can  be  written  as: 
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The  above  eqnation  can  be  written  in  a  compact  matrix  form  as  follows. 


(15) 


b  =  Ap  +  n 


(16) 


where  b  is  the  2n  X  1  vector  of  x  and  y,  A  is  the  2n  X  8  form  matrix,  p  is  the  8x1 
vector  of  nnknown  parameters,  and  n  is  the  2n  X  1  vector  of  noise  valnes.  If  the  nnmber 
of  corresponding  points  n  is  fonr,  we  have  eight  eqnations  in  eight  nnknowns,  and  thns 
can  solve  for  p  nniqnely  by  solving  the  following  system  of  eqnations: 


b  =  Ap. 


(17) 


If  we  have  more  than  fonr  correspondences,  we  can  solve  for  for  p  in  a  least  sqnares  sense 
by  solving  the  following  linear  system  of  eqnations  (see  [Kan96]): 

A*b  =  A*Ap.  (18) 


6  Finding  corresponding  feature  points 

In  a  document  image  with  text,  hgnres  and  mathematics,  there  are  no  nniversal  featnre 
points  in  the  interior  of  the  docnment  that  are  gnaranteed  to  appear  in  each  type  of 
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document.  However,  most  documents  have  a  rectangular  text  layout,  whether  they  are 
in  one-column  format  or  in  two-column  format.  We  use  the  upper-left  (UL),  upper-right 
(UR),  lower-right  (LR),  and  lower-left  (LL),  corners  of  the  text  area  as  feature  points. 

The  four  feature  points,  pi, .  .  .  ,p4,  are  detected  on  the  ideal  image  as  follows  (assume 
the  origin  at  the  top  left  corner  of  the  image  and  a  row-column  coordinate  system). 

1.  Compute  the  connected  components  in  the  image. 

2.  Compute  the  upper-left  (a^),  upper- right  (h),  lower-right  (cj),  and  lower-left  (dj) 
corners  of  the  bounding  box  of  each  connected  component. 

3.  Find  the  four  feature  points  using  the  following  equations: 


Pi 

=  arg  minfxfa^) 

^  ^  ^ 

+  y{oi))-, 

P2 

=  arg  max(x(&j) 

bt 

-  y{U)), 

P3 

=  arg  max(x(cj) 

+  y{ci)), 

Pi 

=  argmin(x(dj) 

dt 

-  y{di))- 

The  above  equations  compute  the  upper-left  (pi),  upper-right  (P2),  lower-right  (ps), 
and  lower-left  (P4)  feature  points  on  the  ideal  image. 

The  above  algorithm  is  also  used  to  compute  the  corresponding  four  feature  points 
O'!, .  .  .  ,q'4  on  the  real  image.  Since  sometimes  noise  blobs  can  appear  in  a  real  image, 
we  check  to  see  that  the  bounding  box  sizes  of  the  components  are  within  a  specihed  tol¬ 
erance.  Furthermore,  a  potential  problem  can  arise  when  two  bounding  boxes  have  their 
corners  on  a  45  degree  line.  A  transformation  T  can  be  estimated  using  the  corresponding 
points  pi, .  .  .  ,p4  and  o'!, .  .  .  ,  q'4  by  the  methods  described  in  section  5. 

7  Registration  results  on  scanned  images 

Unfortunately,  none  of  the  three  geometric  transformations  described  in  sections  4  and 
5  model  the  transformation  very  accurately.  That  is,  the  real  points  are  displaced  from 
ideal  transformed  points  in  some  nonlinear  fashion. 

To  investigate  further,  we  drew  a  rectangle  on  a  blank  image,  and  then  print  and  scan 
it.  The  opposite  sides  of  the  scanned  hgure  were  no  longer  equal  in  length.  In  fact,  all  the 
four  sides  were  of  different  lenght,  suggesting  a  projective  transform,  which  could  arise 
because  the  image  plane  and  the  original  document  plane  are  not  parallel  to  each  other. 
If  the  projective  transform  cannot  model  the  transformation,  the  mismatch  must  arise 
from  nonlinearities  in  the  optical  and  mechanical  systems.  Note  that  these  nonlinearities 
could  be  either  in  the  printer  or  the  photocopier  or  the  scanner  or  in  any  combination 
of  the  three  units.  We  suspect  that  the  non-linearities  in  the  sensor-motion  accounts  for 
most  of  the  mismatch.  Small  perturbations  around  the  nominal  position  could  be  cause 
of  spatial  quantization. 

In  hgure  1  we  show  a  subimage  of  a  scanned  image  with  the  groundtruth  (character 
bounding  boxes)  overlaid.  We  see  that  there  is  a  lot  of  error.  This  error  is  not  systematic 
over  the  entire  page. 
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Figure  1:  A  scanned  image  with  groundtruth  overlaid.  In  this  case  perspective  transform 
was  used  to  register  the  ideal  document  image  to  the  scanned  image.  It  can  be  seen  that 
there  is  large  error  in  groundtruth 

To  conhrm  the  fact  that  there  are  nonlinearities  in  the  printing-photocopying-scanning 
processes,  we  set  up  a  calibration  experiment  and  performed  a  statistical  test  to  prove 
the  point  that  the  similarity,  affine  and  projective  transforms  alone  do  not  model  the 
transformation  correctly.  The  calibration  experiment  is  described  in  the  following  sub¬ 
sections. 

7.1  The  calibration  experiment 

We  now  describe  a  controlled  experiment  that  was  conducted  to  conhrm  the  fact  that 
the  geometric  transformation  that  occurs  while  printing  and  scanning  documents  is  not 
similarity  or  affine  or  perspective.  First  we  create  an  ideal  calibration  image  consisting  of 
only  ‘+’  symbols  arranged  in  a  grid.  We  print  this  document  and  then  scan  it  back.  The 
the  crosses  in  the  ideal  image  are  then  matched  to  the  crosses  in  the  scanned  image.  This 
set  of  corresponding  points  are  then  use  to  estimate  the  geometric  transform  parameters. 
The  sample  mean  and  sample  covariance  matrix  of  the  registration  error  vectors  are 
then  computed.  Since  the  population  mean  and  population  covariance  matrix  of  the 
error  vectors  can  be  theoretically  derived,  we  can  test  whether  the  theoretically  derived 
distribution  parameters  are  close  to  the  experimentally  gathered  sample  parameters. 

In  the  next  subsection  we  provide  the  details  of  the  calibration  data  gathering  process. 
In  the  subsequent  subsection  we  give  the  details  of  the  statistical  hypothesis  testing 
procedure. 
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7.2  Protocol  for  Calibration  Experiment 

The  ideal  image  for  calibrating  the  printer-photocopier-scanner  process  is  created  as 
follows.  First  a  grid  of  eqnally  spaced  “+”  symbols  is  arranged  on  a  3300  X  2500  binary 
image.  The  vertical  and  horizontal  bars  of  the  “+”  symbol  are  25  pixels  long  and  3 
pixels  thick.  The  nnmber  of  symbols  on  each  row  and  column  of  the  grid  are  23  and  30, 
respectively. 

The  ideal  image  is  then  printed  and  scanned.  The  intersection  points  of  the  two 
bars  of  the  “+”  symbols  are  nsed  as  the  calibration  points.  The  calibration  points  are 
detected  by  a  morphological  algorithm:  hrst  the  image  is  closed  with  a  3  X  3  sqnare 
strnctnring  element.  Next,  two  images  are  created  by  opening  the  closed  image  with 
a  vertical  and  horizontal  strnctnring  elements,  respectively.  Calibration  points  on  the 
scanned  image  are  detected  by  binary-anding  these  two  images.  A  connected  component 
algorithm  is  then  rnn  on  the  image  with  the  detected  calibration  points.  The  centroids 
of  the  connected  components  are  nsed  as  the  coordinates  of  the  calibration  points.  The 
calibration  points  in  the  ideal  image  are  known  since  the  ideal  calibration  image  is  created 
nnder  experimenter’s  control. 

To  estimate  the  projective  transform,  fonr  featnre  points  are  hrst  detected  nsing  the 
algorithm  described  in  section  6.  Next,  we  estimate  the  projective  transform  parameters 
from  the  ideal  and  real  points  (correspondences  are  known  since  we  order  the  fonr  points 
in  a  connter  clockwise  order,  starting  with  the  npper  left  featnre  point,  and  assnme  that 
the  orientation  of  the  page  is  nnchanged).  The  estimated  transform  parameters  are  then 
nsed  to  project  all  the  ideal  points.  An  exhanstive  search  is  condncted  to  establish  cor¬ 
respondences  between  the  projected  ideal  calibration  points  and  real  calibration  points. 
That  is,  for  each  projected  ideal  point,  we  hnd  the  closest  real  point,  and  assnme  the 
two  points  match.  This  is  done  by  a  brnte-force  O(n^)  algorithm.  Since  n  is  of  the 
order  of  tOOO,  the  compntation  reqnired  is  of  the  order  tO®,  which  takes  approximately 
three  seconds  on  a  Sparc  2.  Next,  for  each  calibration  point  we  compnte  the  registration 
error  vector,  which  is  the  displacement  vector  between  the  real  calibration  point  and  the 
projected  calibration  point.  The  maximnm  error  we  attain  is  with  ±4  pixels  in  each 
coordinate. 

In  Fignre  2(a)  we  show  a  snbimage  of  the  scanned  calibration  document.  The  detected 
calibration  points  are  shown  in  Fignre  2(b).  In  Fignre  2(c)  the  ideal  calibration  points 
are  transformed  nsing  the  estimated  projective  transformation  and  overlaid  on  the  real 
calibration  points.  A  scatter  plot  of  the  error  vectors  is  shown  in  Fignre  3. 

7.3  Statistical  tests 

Since  the  estimated  parameters  of  the  models  are  fnnctions  of  real  point  coordinates, 
which  are  random  variables,  the  estimated  parameters  are  random  variables.  The  distri- 
bntion  of  estimated  parameters  can  be  derived  in  terms  of  the  assnmed  distribntion  of  the 
noise  in  the  real  point  coordinates.  To  conhrm  that  the  geometric  transformation  model 
and  the  noise  model  are  valid,  we  test  whether  or  not  the  theoretically  derived  distribn¬ 
tion  of  the  estimated  parameter  vector  is  the  same  as  that  compnted  empirically.  If  either 
the  geometric  transformation  model  or  the  noise  model  is  incorrect,  the  test  for  eqnality 
of  the  empirically  compnted  distribntion  and  the  theoretically  derived  distribntion  will 
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Figure  2:  (a)  A  subimage  of  the  scanned  calibration  document.  The  detected  calibration 
points  are  shown  in  (b).  (c)  The  ideal  calibration  points  are  transformed  using  the 

estimated  projective  transformation  and  overlaid  on  the  real  calibration  points. 

not  pass.  Furthermore,  instead  of  testing  the  distribution  of  the  estimated  parameters, 
we  can  test  the  distribution  of  the  residual  error,  which  in  turn  has  a  known  distribution. 
A  brief  description  of  the  hypothesis  testing  procedure  is  given  in  appendix  A  and  the 
theory  and  software  is  described  in  detail  in  [Kan96,  KH95]. 

8  Dealing  with  nonlinearities 

As  we  saw,  because  of  the  nonlinearities,  the  groundtruth  bounding  boxes  for  the  char¬ 
acters  in  a  scanned  image  are  not  correct.  Our  solution  to  this  problem  is  very  simple. 
We  hrst  transform  the  ideal  document  image  using  the  perspective  transformation.  The 
groundtruth  associated  with  the  ideal  image  is  also  transformed  using  the  estimated 
perspective  transform  parameters.  Next,  each  character  in  the  perspective  transformed 
image  is  locally  translated  and  matched  (using  Hamming  distance)  with  the  a  same  size 
subimage  in  the  scanned  document  image.  Thus,  if  the  the  nonlinearity  gave  rise  to 
a  (2,  3)  translation  error  in  pixels,  our  template  matching  process  would  give  the  best 
match  (minimum  Hamming  distance)  when  the  translation  is  (2,3).  The  size  of  the 
search  window  is  decided  by  the  calibration  experiment.  If  the  error  vectors  are  large, 
the  search  window  has  to  be  made  large.  This  local  search  process  gives  us  a  highly 
accurate  groundtruth,  and  the  potential  errors  are  within  a  pixel. 

9  Dealing  with  outliers:  robust  regression 

At  times,  when  two  very  similar  characters  (for  example  two  ‘i’s,  or  one  ‘i’  and  one 
‘1’)  are  close  to  each  other,  the  template  matching  process  can  match  the  perspective 
transformed  character  to  the  wrong  scanned  character.  This  typically  happens  if  we  use 
a  large  search  window  size.  This  means  that  the  error  translation  vector  associated  the 
wrongly  matched  character  will  be  off.  Fortunately,  we  have  another  way  of  detecting 
such  outliers.  Briefly  the  procedure  is  as  follows.  Once  the  error  vectors  are  computed,  we 
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Figure  3:  A  scatter  plot  of  the  error  vectors  computed  between  real  calibration  points 
and  projected  ideal  calibration  points.  The  dimension  is  in  pixels  for  each  axis. 


can  bt  a  multivariate  function  to  the  x  and  y  translation  errors  associated  with  characters 
in  a  small  area  of  the  image,  ft  can  be  assumed  that  within  this  small  area  the  error 
vectors  will  not  vary  much.  Thus  if  we  perform  a  robust  regression,  the  outlier  error 
vectors  will  be  immediately  detected,  and  can  be  corrected.  For  regression  we  use  piece- 
wise  bilinear  functions,  which  we  now  describe.  A  discussion  on  bilinear  function  fitting 
and  image  warping  can  be  found  in  [Wol90]. 

Let  the  function  /  :  be  an  image-to-image  nonlinear  function.  We  are  given 

the  ideal  calibration  points  pi, .  .  .  ,p„,  and  the  corresponding  observed  points  .  .  .  ^q^. 
That  is,  qi  =  f{pi)  +  q-  The  problem  is  to  construct  a  piecewise  bilinear  function  that 
approximates  /  in  the  sense  that 

n 

f{pk)\\  (19) 

k=l 

is  minimized. 

The  piecewise  bilinear  function  is  represented  as  follows.  First,  a  grid  of  points  j, 
with  i  =  f , .  .  . ,  /  and  j  =  f , .  .  .  ,  m  on  the  first  image  are  identified.  The  grid  points 
are  such  that  the  y-coordinates  of  the  points  along  any  row  of  grid  points  are  the  same 
and  the  ^-coordinates  of  points  along  any  column  of  grid  points  are  the  same.  That  is, 
yiSij)  ~  yiSi,k)  i  “  f , .  .  . ,  m,  k  =  f , .  .  .  ,  /.  Furthermore,  there  is  a  natural  ordering 
of  the  grid  point  coordinates:  and  y{gij)  <  y{gij+i).  Note  that  the 

number  of  grid  points  is  much  less  than  the  number  of  calibration  points:  I  X  m  <  n. 

We  represent  the  nonlinear  function  /  by  representing  the  transformation  on  the 
grid  of  points  gij.  Let  gij  +  j  be  the  grid  point  after  the  function  /  transforms  the 
grid  point  gij.  Let  the  point  p  he  within  a  grid  cell  whose  four  corner  grid  points  are: 
a  =  gij,b  =  =  gij+i-  The  transformation  of  the  point  p  is  then 

approximated  as  follows.  Let 

i  =  (x(p)  —  x[a))/[x[h)  —  x[a)), 

=  {y{p)-y{d))l{y{d)-y{a)). 


s 
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(20) 

(21) 


Then  the  point  q  =  /(p)  +  q  after  transformation  is  given  by 

q  =  p  [1  —  f)(f  —  s)Aa  +  t(\  —  s)Ah  +  tsAc  +  (f  —  t)sAd  +  ry,  (22) 

where  Aa  =  Agij]  and  A&,  Ac,  and  Ad  are  dehned  similarly. 

Let  bk^Ckid):  be  the  corner  points  of  the  grid  cell  within  which  the  point  pk  lies, 
and  let  tk  and  Sk  be  constants  calcnlated  nsing  eqnations  (20)  and  (21).  Eqnation  (19) 
can  be  stated  as:  Find  Aa^,,  Abk-,  Ac^,,  Adk  to  minimize 

n 

\  \qk  ~  [pa:  +  (1  ~  !*)(!  ~  ■Sk)Aak  +  tk{l  —  Sk)Abk  +  tkSkAck  +  (1  —  tk)skAdk]\\  ■ 

k=l 

(23) 

In  the  above  eqnation,  ont  of  the  n  X  4  elements  Aa^t,  Ab^,  Ac^,,  Adk,  k  =  1, .  .  .  ,  n,  only 
I  X  m  elements  are  nniqne.  For  example,  Acg  and  Ad2o  both  might  represent  the  same 
grid  point  variation,  A(y4  5  :  Acg  =  Ad2o  =  A(y4  5.  We  can  now  give  nniqne  labels  to  the 
grid  differences,  setnp  a  system  of  linear  eqnations,  and  solve  for  the  nniqne  elements  in 
a  least  sqnares  sense. 

10  Experimental  Protocol  and  Results 

10.1  Data  Collection 

The  ideal  data  is  a  fATgX  formatted  docnment  [Knn88,  Lam86].  The  IEEE  Transaction 
style  is  nsed  for  typesetting  the  docnment  for  English  documents.  Hindi  documents 
in  Devanagari  fonts  are  formatted  nsing  pnblic  domain  fATgX  macros  [Vel].  The  ideal 
binary  image  and  character  gronnd  trnth  is  created  nsing  the  DVI2TIFF  software.  The 
ideal  docnment  is  created  at  300  X  300  dots/inch  resolntion  and  the  size  of  the  binary 
docnment  in  pixels  is  3300  X  2550.  This  docnment  is  printed  nsing  a  SparcPrinter  II. 
Next,  the  original  printed  docnment  is  photocopied  hve  times  nsing  a  Xerox  photocopier 
-  once  at  the  normal  setting,  twice  with  darker  settings,  and  twice  with  lighter  settings. 
Finally  the  hve  photocopied  docnments  are  scanned  nsing  a  Ricoh  scanner.  The  scanner 
is  set  at  300  X  300  dots/inch  resolntion.  The  rest  of  the  scanner  parameters  are  set  at 
normal  settings.  The  scanned  binary  image  is  of  size  3307  X  2544. 

10.2  Protocol  for  Generating  Real  Ground  Truth 

Once  the  real  scanned  docnments  have  been  gathered  as  described  in  the  previons  section, 
we  nse  the  registration  algorithm,  described  in  section  1  to  i)  transform  the  ideal  binary 
docnments  so  that  it  registers  to  the  scanned  docnment  and  ii)  to  create  the  gronnd 
trnth  corresponding  to  the  scanned  docnment.  The  transformed  gronnd  trnth  also  forms 
the  gronnd  trnth  for  the  transformed  ideal  docnment.  The  local  nonlinearities  of  the 
transformation  are  acconnted  for  by  searching  in  a  local  neighborhood  for  a  good  match 
between  the  ideal  character  symbol  and  the  real  character  symbol.  The  local  template 
match  window  size  is  determined  by  the  calibration  experiment  we  performed  earlier. 
Since  the  maximnm  error  in  the  registration  is  ±4  pixels,  we  nsed  a  window  with  —  7  < 
Ax,  Ay  <  7.  The  gronnd  trnth  generated  by  onr  algorithm  is  highly  accnrate.  A  snbimage 
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of  the  scanned  image  with  the  overlaid  bonnding  box  is  shown  in  Fignre  4.  An  exclnsive 
or-ed  image  of  the  real  scanned  docnment  and  the  registered  ideal  docnment  is  show  in 
Fignre  4.  The  exclnsive  OR  images  shows  that  the  gronndtrnth  for  each  character  is 
centered  on  the  character  and  the  differences  are  at  the  character  edge.  These  differences 
dne  to  the  image  point  spread  fnnction  of  the  printing  and  scanning  are  what  is  expected. 
The  time  taken  for  this  procednre  on  a  SUN  SPARC  5,  is  2  minntes. 
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Fignre  4:  Gronnd  trnth  for  real  documents,  (a)  shows  a  snbimage  of  a  docnment  with 
the  estimated  bonnding  boxes  of  each  character,  (b)  shows  the  resnlt  of  exclnsive-OR 
between  the  real  docnment  and  the  registered  ideal  docnment.  The  exclnsive  OR  images 
shows  that  the  gronndtrnth  for  each  character  is  centered  on  the  character  and  the 
differences  are  at  the  character  edge.  These  differences  dne  to  the  image  point  spread 
fnnction  of  the  printing  and  scanning  are  what  is  expected. 


11  Summary 

In  this  paper  we  presented  a  closed-loop  method  for  prodncing  character  gronndtrnth 
for  real  docnment  images.  The  method  starts  by  generating  ideal  noise-free  docnment 
images  nsing  a  docnment  typesetting  software  like  RTgX.  These  binary  docnment  images 
are  printed,  photocopied/FAXed,  and  then  scanned.  Featnre  points  are  extracted  from 
the  ideal  and  the  scanned  docnment  images,  and  their  correspondences  established.  We 
showed  that  the  similarity,  affine  and  projective  transformations  alone  cannot  be  nsed  to 
represent  the  transformation  between  the  ideal  and  the  scanned  documents.  This  fact 
was  conhrmed  by  nsing  test  images  specially  designed  for  calibration,  and  verifying  that 
the  statistical  distribntion  of  the  registration  error  is  not  what  the  theory  predicts.  The 
local  nonlinearities  that  exist  can  be  acconnted  for  by  performing  a  local  template  match 
nsing  the  ideal  character  as  the  template,  and  searching  a  small  neighborhood  in  the  real 
image  for  the  best  match.  The  size  of  the  local  search  neighborhood  is  decided  by  the 
calibration  experiment.  The  calibration  experiment  gives  ns  the  maximnm  deviations 
that  can  occnr  between  the  ideal  featnre  points  after  they  have  been  transformed  nsing 
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Figure  5:  A  subimage  of  a  FAXed  document  with  the  groundtruth  overlaid.  Notice  that 
the  characters  in  the  bottom  left  of  the  image  are  hardly  legible.  Manual  groundtruth  for 
these  type  of  documents  would  be  prone  to  errors.  In  contrast,  our  software  has  produced 
correct  groundtruth  without  any  error. 


the  estimated  transformation  and  the  feature  points  on  the  scanned  image.  If  character, 
word  or  zone  groundtruth  is  available  for  any  real  document,  the  registration  algorithm 
can  be  used  to  generate  the  corresponding  groundtruth  for  a  rescanned  version  of  the 
document. 

Keywords:  Automatic  real  groundtruth,  document  analysis,  performance  evalua¬ 
tion,  registration,  geometric  transformations. 
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A  Statistical  testing  of  distribution  parameters 

In  this  appendix  we  give  a  test  procedure  for  testing  the  null  hypothesis  that  the  popu¬ 
lation  mean  vector,  g,  and  population  covariance,  S,are  equal  to  a  particular  vector,  go, 
and  matrix.  So,  respectively.  The  test  statistic  is  based  on  the  sample  mean  and  sample 
covariance  matrix,  and  its  null  distribution  is  a  distribution.  For  a  detailed  discussion 
see  [KB95,  And84]. 

The  null  hypothesis  Hn  is: 

Hn  '■  S  =  So,  and  g  =  g^. 

Let  X  and  A  be  dehned  as: 

1  ” 

X  =  —  ^2 


S  =  - -  '^{xi  -  x){xi  -  x)\ 

where  we  have  assumed  that  the  data  vectors  Xi  are  p-dimensional  and  the  sample  size 
is  n.  Let  B  =  (n  —  1)5'  and 

A  =  (e/n)"”/'|BS-'r^'exp  (-[tr(BS-')  +  n{x  -  gofE-\x  -  //o)]/2)  . 


Then  the  test  statistic,  T,  is: 


T  =  —2  log  A 
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(24) 


Distribution  of  the  test  statistic  T  under  true  null  hypothesis  is  (for  proofs  see 
[And84]): 

^  ^  Xp(p+l)/2+p- 

Thus  the  test  procedure  with  a  signihcance  level  a  is: 

1.  Compute  the  test  statistic  t. 

2.  Compute  the  p-value: 

Vvalue  =  Prob(  T  >  i\T  ^  Xp(p+i)/2+p)- 

3.  If  pvalue  <  a,  reject  the  null  hypothesis,  Hn. 
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